So, you have some data… perhaps you want to look at how your patients are responding to an exercise intervention for obesity and see if there’s a difference in compliance and weight loss success that could be related to violence near home that make it unsafe to play outside.
You’ll want to have a few packages installed. You may already have these, but if not, run the following:
install.packages("dplyr")
install.packages("leaflet")
install.packages("jsonlite")
install.packages("ggplot2")
install.packages("maptools")
install.packages("sp")
install.packages("rgdal")
install.packages("scales")
The City of Philadelphia supplies information about shootings (including officer-involved shootings) which includes data about the shooting victim and the location. Here, we’re really interested in the location of shootings over the past few years, to understand what parts of Philadelphia are more prone to this specific kind of violence.
To see more information about this dataset, please visit https://www.opendataphilly.org/dataset/shooting-victims/resource/a6240077-cbc7-46fb-b554-39417be606ee?inner_span=True.
For our purposes, we’re going to get the bare minimum of information: latitude, longitude, and shooting ID. The API endpoint is described in the link above and uses a SQL query to select only the data we care about. Because our query has spaces and other special characters, we need to “encode” it for request.
The data will come in as json, which we’ll parse.
library(jsonlite)
url <- URLencode('https://www.opendataphilly.org/api/action/datastore_search_sql?sql=SELECT _id, lat, lng from "a6240077-cbc7-46fb-b554-39417be606ee"')
shooting_data <- fromJSON(url)
We can examine the shooting data by using R’s str (structure) command:
str(shooting_data)
## List of 3
## $ help : chr "https://www.opendataphilly.org/api/3/action/help_show?name=datastore_search_sql"
## $ success: logi TRUE
## $ result :List of 3
## ..$ records:'data.frame': 2500 obs. of 3 variables:
## .. ..$ lat: chr [1:2500] "39.922968633" "40.0296637640001" "40.037804839" "39.9682945860001" ...
## .. ..$ lng: chr [1:2500] "-75.18870837" "-75.1205859089999" "-75.139286339" "-75.223324033" ...
## .. ..$ _id: int [1:2500] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ fields :'data.frame': 3 obs. of 2 variables:
## .. ..$ type: chr [1:3] "int4" "numeric" "numeric"
## .. ..$ id : chr [1:3] "_id" "lat" "lng"
## ..$ sql : chr "SELECT _id, lat, lng from \"a6240077-cbc7-46fb-b554-39417be606ee\""
Here we see that we have a nested data frame, accessible at shooting_data$result$records:
head(shooting_data$result$records, 6)
## lat lng _id
## 1 39.922968633 -75.18870837 1
## 2 40.0296637640001 -75.1205859089999 2
## 3 40.037804839 -75.139286339 3
## 4 39.9682945860001 -75.223324033 4
## 5 39.995426509 -75.126076954 5
## 6 40.0429494850001 -75.1617552229999 6
If we wanted to, we could easily create a map of these shootings, just based on latitude and longitude. Since latitude and longitude are currently in “chr” (character) format, we’ll make them numeric so that we can do math on them. We’ll create a map that’s centered on the mean latitude and longitude of all our shootings, and which is appropriately zoomed in (you might have to experiment with the zoom factor).
We’re going to add a standard road map below to show more context, using addTiles.
library(leaflet)
library(dplyr)
shootings <- shooting_data$result$records
shootings$lat <- as.numeric(shootings$lat)
shootings$lng <- as.numeric(shootings$lng)
shootings %>%
leaflet() %>%
addTiles() %>%
setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE),
lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 10) %>%
addMarkers(clusterOptions = markerClusterOptions())
What’s more likely, however, is that we want to use polygon data to create a map that shows how much a particular area is affected. This is because we want to create combined data – we want to put information about our patients or research subjects along with the level of violence they are exposed to. Instead of using latitude and longitude, we’ll gather the number of shootings per Census tract, which we can then use as a proxy for violence exposure for the patients and subjects who live in that Census tract. It’s a sort of “binning”, but using the existing “bins” of Census tracts.
library(rgdal)
philadelphiaCensusTracts <- readOGR("http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson")
## OGR data source with driver: GeoJSON
## Source: "http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson", layer: "OGRGeoJSON"
## with 384 features
## It has 14 fields
Now what we’d like to do is get the shootings-per-tract data, which we can then combine with our research or clinical data to see if violence near home has any effect on our outcomes. To do this, we take the latitude and longitude of our shootings and transform them slightly so that they are understood as spatial coordinates, not just pairs of numbers. We’ll use the same map projection used in our original philadelphiaCensusTracts.
library(sp)
coordinates <- SpatialPoints(shootings[c("lng", "lat")])
proj4string(coordinates) <- proj4string(philadelphiaCensusTracts)
Let’s now apply what we know about our polygons (from philadelphiaCensusTracts) and apply that to our points. We’ll end up with a table that has one row for each shooting coordinate. Essentially, what we’re doing is taking each point, lining it up with a matching polygon, and then getting the data about that polygon, which came along with the geoJSON file we downloaded. We don’t want our NAME10 to be a factor variable, but a character field (who knows if next Census we might have something like 176.02.A, which is why we don’t use numeric).
shooting_tract_data <- over(coordinates, philadelphiaCensusTracts)
shooting_tract_data$NAME10 <- shooting_tract_data$NAME10
head(shooting_tract_data)
## OBJECTID STATEFP10 COUNTYFP10 TRACTCE10 GEOID10 NAME10
## 1 198 42 101 003600 42101003600 36
## 2 112 42 101 029000 42101029000 290
## 3 356 42 101 028200 42101028200 282
## 4 57 42 101 010300 42101010300 103
## 5 141 42 101 017602 42101017602 176.02
## 6 98 42 101 024700 42101024700 247
## NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10
## 1 Census Tract 36 G5020 S 964539 0 +39.9279255
## 2 Census Tract 290 G5020 S 818133 20001 +40.0296096
## 3 Census Tract 282 G5020 S 855812 0 +40.0348704
## 4 Census Tract 103 G5020 S 281357 0 +39.9678322
## 5 Census Tract 176.02 G5020 S 400458 0 +39.9967182
## 6 Census Tract 247 G5020 S 941266 0 +40.0417034
## INTPTLON10 LOGRECNO
## 1 -075.1920206 10377
## 2 -075.1166058 10605
## 3 -075.1403327 10596
## 4 -075.2254383 10437
## 5 -075.1305528 10507
## 6 -075.1645311 10560
We see the first few lines of the Census data for each of our shootings. For example, the first shooting in our shooting data corresponds to Census tract 36, which is in State 42 (Pennsylvania) and County 101 (Philadelphia County). We can use this to find out how many shootings take place in each Census tract.
shootings_by_census_shortname <- shooting_tract_data %>%
group_by(NAME10) %>%
summarise(num_shootings = n()) %>%
ungroup()
head(shootings_by_census_shortname)
## # A tibble: 6 x 2
## NAME10 num_shootings
## <fct> <int>
## 1 1 2
## 2 10.02 1
## 3 100 3
## 4 101 9
## 5 102 9
## 6 103 14
Don’t forget that there are some Census tracts that aren’t represented at all in our shooting_tract_data data frame, so let’s make sure we enrich it with all the tracts that aren’t included in the shooting data. We can get those by taking the data frame of our tract data, selecting the list of all the Census tracts in Philadelphia, and making sure that if they weren’t mentioned above, we add them, but with num_shootings equal to 0.
non_shooting_tracts <- philadelphiaCensusTracts@data %>%
select(NAME10) %>%
filter(!NAME10 %in% shootings_by_census_shortname$NAME10) %>%
mutate(num_shootings = 0)
head(non_shooting_tracts)
## NAME10 num_shootings
## 1 143 0
## 2 206 0
## 3 14 0
## 4 19 0
## 5 29 0
## 6 50 0
We can now combine the tracts-with-shootings and the tracts-with-no-shootings to get an overall picture of violence by census tract:
shooting_by_tract <- rbind(shootings_by_census_shortname, non_shooting_tracts)
Important aside: this data is completely fabricated. So ’appearances to any person, living or dead, are coincidental.
Let’s take a peek at our fake data:
fake_exercise_data <- read.csv("../Data/fake_exercise_data.csv", stringsAsFactors = FALSE)
head(fake_exercise_data)
## mrn census_tract daily_exercise_minutes
## 1 3755308 274.02 17
## 2 8144402 279.01 25
## 3 4819425 247.00 32
## 4 8974594 141.00 32
## 5 9478166 279.01 73
## 6 1633889 177.01 31
We now have two data frames that interest us:
Let’s combine them and take a quick look:
exercise_and_violence <- merge(x=fake_exercise_data, y = shooting_by_tract,
by.x = "census_tract", by.y = "NAME10", all = TRUE)
head(exercise_and_violence)
## census_tract mrn daily_exercise_minutes num_shootings
## 1 1 8276686 27 2
## 2 10.01 NA NA 0
## 3 10.02 4012634 41 1
## 4 10.02 4819049 86 1
## 5 10.02 6644697 77 1
## 6 100 4318947 40 3
Do we see any trends? Let’s do a quick plot.
plot(exercise_and_violence$num_shootings, exercise_and_violence$daily_exercise_minutes)
Wow, it looks like there’s a trend here! (Which makes sense, since I baked it into my fake data…)
At this point, statistically, we could do a two-sample test on the top and bottom quartile of our subjects to see if near-home violence leads to decreased exercise, or create a predictive model using regression.
We could also create a map – something that is easily understandable for policymakers and philanathropists to understand, or something you could put into a publication or on your website. Since in this case you’re talking about statisics per Census tract, you’ll want to simplify your data and find something like the mean or median amount of exercise per tract. Let’s do that now:
exercise_per_tract <- exercise_and_violence %>%
group_by(census_tract) %>%
summarise(mean_exercise = mean(daily_exercise_minutes)) %>%
ungroup()
head(exercise_per_tract)
## # A tibble: 6 x 2
## census_tract mean_exercise
## <chr> <dbl>
## 1 1 27
## 2 10.01 NA
## 3 10.02 68
## 4 100 29.5
## 5 101 66
## 6 102 NA
And we can combine our exercise by tract with our shootings by tract, for an almost-map-ready dataset:
exercise_shootings_per_tract <- merge(x=exercise_per_tract, y=shooting_by_tract,
by.x="census_tract", by.y="NAME10",
all = TRUE)
head(exercise_shootings_per_tract)
## census_tract mean_exercise num_shootings
## 1 1 27.0 2
## 2 10.01 NA 0
## 3 10.02 68.0 1
## 4 100 29.5 3
## 5 101 66.0 9
## 6 102 NA 9
*** WARNING WARNING PITFALL AHEAD! ***
First, we’ll combine exercise_shootings_per_tract with the data found in philadelphiaCensusTracts@data:
census_tracts <- philadelphiaCensusTracts@data %>% mutate(NAME10 = as.character(NAME10))
census_tracts <- merge(x=census_tracts, y=exercise_shootings_per_tract, by.x="NAME10", by.y="census_tract")
Then we’ll add our enriched data back to the geojson data, so that in addition to the fields it came with, it will now contain the exercise and shooting data we gathered. It’s important to order this data by the OBJECTID so that the correct polygon is associated with the correct data!
philadelphiaCensusTracts@data <- census_tracts[order(census_tracts$OBJECTID),]
Now, let’s create an interactive map! We’ll color the polygons by exercise amount to begin with.
exercise_palette <- colorBin("Blues", domain = philadelphiaCensusTracts$mean_exercise, bins = 5, na.color = "#808080")
interactive_map <- leaflet(philadelphiaCensusTracts) %>%
setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE),
lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 11) %>%
addPolygons(
data = philadelphiaCensusTracts,
fillColor = ~exercise_palette(philadelphiaCensusTracts@data$mean_exercise),
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "white", # border color
fillOpacity = 1)
interactive_map
Now, let’s add some labels. We’ll do variable interpolation to create labels that tell what each Census tract is and the exercise and shooting data for that tract:
labels <- sprintf(
"<strong>%s</strong><br/>
Exercise in Minutes: %g <br/>
Number of Shootings: %g",
philadelphiaCensusTracts$NAMELSAD10,
philadelphiaCensusTracts$mean_exercise,
philadelphiaCensusTracts$num_shootings
) %>% lapply(htmltools::HTML)
Then we’ll create the map again, but with labels. This allows the viewer to see at a glance the violence and exercise metrics for each tract!
interactive_map <- leaflet(philadelphiaCensusTracts) %>%
setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE),
lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 11) %>%
addPolygons(
fillColor = ~exercise_palette(mean_exercise),
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "white", # border color
fillOpacity = 1,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"))
interactive_map
What if we also want to show some color values for shootings, like we did above in our static maps? We’ll layer: our first layer will be a map that has all white borders and our second layer will have red borders, but they’ll only be visible for the shapes for which shootings are more than 10.
border_opacity <- as.numeric(philadelphiaCensusTracts$num_shootings >= 10)
interactive_map <- leaflet(philadelphiaCensusTracts) %>%
setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE),
lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 11) %>%
addPolygons(
fillColor = ~exercise_palette(mean_exercise),
weight = 1, # border thickness
color = "white",
fillOpacity = 1) %>%
addPolygons(
fillOpacity = 0,
color = "red",
opacity = border_opacity,
weight = 2,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")
)
interactive_map